Skip to content

feat(server): ContinuationCompat (SEP-2322 Option H suspend/resume)#2067

Closed
felixweinberger wants to merge 6 commits into
fweinberger/s3-requires-inputfrom
fweinberger/f4-continuation
Closed

feat(server): ContinuationCompat (SEP-2322 Option H suspend/resume)#2067
felixweinberger wants to merge 6 commits into
fweinberger/s3-requires-inputfrom
fweinberger/f4-continuation

Conversation

@felixweinberger
Copy link
Copy Markdown
Contributor


Adds ContinuationCompat: when a handler awaits ctx.mcpReq.send() on a stateless path, the live handler frame is parked, the current response carries IncompleteResult{requestState}, and a later POST with {requestState, inputResponses} resumes the same frame. TTL-bounded continuation store; resumption is bound to the original principal (auth token or session).

Motivation and Context

SEP-2322 Option H — lets existing await elicitInput() handler code work unchanged on the request/response transport, where there is no live backchannel to await on. Opt-in via shttpHandler({continuations}). Single-process by design.

How Has This Been Tested?

New continuationCompat.test.ts (suspend, resume, TTL expiry, principal-mismatch reject, concurrent-resume guard). Dual conformance both 40/40. pnpm test:all passes.

Breaking Changes

None. Opt-in.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation update

Checklist

  • I have read the MCP Documentation
  • My code follows the repository's style guidelines
  • New and existing tests pass locally
  • I have added appropriate error handling
  • I have added or updated documentation as needed

Additional context

Depends on S3, R3. The most security-sensitive new code in the stack — see review guide section "Controversial / pay attention". Review guide: https://gist.github.com/felixweinberger/5a48e0f14d5aced39ed6a91b61940711.


@felixweinberger felixweinberger added the v2-stateless 2026-06 SDK: Protocol decomposition + SEP alignment (request-first / stateless) label May 12, 2026
@changeset-bot
Copy link
Copy Markdown

changeset-bot Bot commented May 12, 2026

⚠️ No Changeset found

Latest commit: 72f15f7

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

@pkg-pr-new
Copy link
Copy Markdown

pkg-pr-new Bot commented May 12, 2026

Open in StackBlitz

@modelcontextprotocol/client

npm i https://pkg.pr.new/@modelcontextprotocol/client@2067

@modelcontextprotocol/server

npm i https://pkg.pr.new/@modelcontextprotocol/server@2067

@modelcontextprotocol/express

npm i https://pkg.pr.new/@modelcontextprotocol/express@2067

@modelcontextprotocol/fastify

npm i https://pkg.pr.new/@modelcontextprotocol/fastify@2067

@modelcontextprotocol/hono

npm i https://pkg.pr.new/@modelcontextprotocol/hono@2067

@modelcontextprotocol/node

npm i https://pkg.pr.new/@modelcontextprotocol/node@2067

commit: 72f15f7

@felixweinberger
Copy link
Copy Markdown
Contributor Author

@claude review

1 similar comment
@felixweinberger
Copy link
Copy Markdown
Contributor Author

@claude review

Comment thread packages/server/src/server/continuationCompat.ts
Comment on lines +391 to +395
return (req: Request, opts?: RequestOptions): Promise<Result> => {
if (opts?.signal?.aborted) {
const r = opts.signal.reason;
return Promise.reject(r instanceof Error ? r : new Error(String(r)));
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 _suspendingSend reads opts?.signal but never opts?.timeout, so await ctx.mcpReq.send(req, {timeout: 5000}) rejects after 5s under BackchannelCompat (backchannelCompat.ts:79-83) but waits up to ttlMs (default 5 min) under ContinuationCompat. Given the class doc positions this as a drop-in that "wins when both are set", that's a silent behavioral divergence for identical handler code. A per-send timeout is arguably the wrong abstraction on a human-in-the-loop suspend path, so the minimal fix is a JSDoc note that RequestOptions.timeout (and resetTimeoutOnProgress/maxTotalTimeout) are superseded by ttlMs here — or honor an explicit opts.timeout if you want strict parity.

Extended reasoning...

What the gap is

RequestOptions.timeout is a documented contract on the public env.send / ctx.mcpReq.send surface (packages/core/src/shared/context.ts:78-82): "If exceeded, an SdkError with code RequestTimeout will be raised… If not specified, DEFAULT_REQUEST_TIMEOUT_MSEC will be used." The dispatcher's ctx.mcpReq.send wrapper does not apply the timeout itself — it forwards options straight through to the env.send sink, which is responsible for honoring it.

The two sinks diverge:

  • BackchannelCompat.makeEnvSend (backchannelCompat.ts:79-83) reads opts?.timeout ?? DEFAULT_REQUEST_TIMEOUT_MSEC and arms a setTimeout that rejects with SdkError(SdkErrorCode.RequestTimeout, …).
  • ContinuationCompat._suspendingSend (this PR, lines ~391-435) reads only opts?.signal (lines 392-394, 415-419, 422, 430). opts?.timeout, opts?.resetTimeoutOnProgress, and opts?.maxTotalTimeout are never referenced.

Step-by-step

  1. A handler calls await ctx.mcpReq.send({method: 'elicitation/create', …}, {timeout: 5000}).
  2. Dispatcher forwards to env.send(req, {timeout: 5000, signal: …}).
  3. Backchannel path: makeEnvSend arms setTimeout(…, 5000); if the client doesn't POST a response within 5s, the promise rejects with RequestTimeout and the handler's catch runs.
  4. Continuation path: _suspendingSend queues the request into the current batch, flushes an incomplete ask on the next microtask, and the returned promise settles only when (a) cont.nextAnswer() resolves on resume, (b) opts.signal aborts, or (c) the frame-level TTL fires via _armcont.fail(reason). With default ttlMs = 300_000, the handler waits up to 5 minutes instead of the requested 5 seconds.

Why nothing else covers it

The continuation-level TTL (_arm) is a frame-wide bound, not a per-send bound, and defaults to 60× the value the handler asked for. The opts.signal plumbing handles cancellation but not timeout. And because both the class JSDoc and the ShttpHandlerOptions.continuations doc say this path "takes precedence" / "wins when both are set", the PR is explicitly positioning ContinuationCompat as a drop-in for BackchannelCompat — so the same handler source silently changes semantics depending on which option the operator configured.

Impact and severity

This is a parity/contract-clarity gap, not a correctness break: nothing hangs forever (TTL bounds it), and the protocol flow is correct. There's also a defensible argument that honoring DEFAULT_REQUEST_TIMEOUT_MSEC (60s) on a human-in-the-loop suspend path would be wrong — the whole point of Option H is that the client may take human-scale time to answer. Hence nit.

Suggested fix

Minimal: add a sentence to the _suspendingSend JSDoc (and/or the class JSDoc) stating that RequestOptions.{timeout, resetTimeoutOnProgress, maxTotalTimeout} are not applied on this path and the frame-level ttlMs governs instead.

Alternatively, if you want strict parity for explicit timeouts only: when opts?.timeout !== undefined, arm a local setTimeout(() => reject(new SdkError(SdkErrorCode.RequestTimeout, …)), opts.timeout) alongside the existing signal listener and clear it in both settle.then branches. (Skipping the ?? DEFAULT_REQUEST_TIMEOUT_MSEC fallback avoids the 60s default firing on every elicitation.)

Comment on lines +283 to +298
} else if (incomingState === undefined) {
const owner = this._resolvePrincipal(env);
if (this._frames.size >= this._max) {
throw new Error(`ContinuationCompat at capacity (maxContinuations=${this._max})`);
}
if (owner !== undefined) {
const n = this._perPrincipalCount.get(owner) ?? 0;
if (n >= this._perPrincipalMax) {
throw new Error(`ContinuationCompat: principal at per-principal capacity (perPrincipalMax=${this._perPrincipalMax})`);
}
this._perPrincipalCount.set(owner, n + 1);
}
token = this._generate();
cont = new Continuation();
this._frames.set(token, { cont, timer: this._arm(token), owner, draining: true });
this._startRunner(onrequest, request, env, cont, token);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Frame insertion and per-principal quota accounting happen eagerly on every fresh request — before the handler has produced an incomplete ask — so a non-suspending tools/list or fast tools/call occupies one of maxContinuations and one of perPrincipalMax for its full in-flight duration. The JSDoc on maxContinuations, perPrincipalMax, and size all say "suspended frames", but the implementation actually counts in-flight or suspended requests; either defer registration until the first incomplete ask is pushed, or reword the docs to match.

Extended reasoning...

What the bug is

In _drive's fresh-request branch (incomingState === undefined), the per-principal counter is incremented, a requestState token is generated, a FrameEntry is inserted into _frames, and a TTL timer is armed — all before _startRunner kicks off the handler and before any env.send has been called. The slot is released only when the drain loop later sees 'done' or 'runnerError' and calls _delete(token). So every request routed through continuations.wrap(...) consumes one global slot and one per-principal slot for its entire in-flight duration, whether or not it ever suspends.

There is no leak: the done branch cleans up correctly, so steady-state size returns to 0 after a non-suspending handler finishes (the "handler that never calls env.send passes through unchanged" test confirms this). The issue is the transient occupancy and the doc/behavior mismatch.

Code path

shttpHandler wraps cb.onrequest unconditionally with continuations.wrap(...) whenever continuations is configured — there is no per-method opt-out, so ping, tools/list, prompts/list, and fast non-elicitating tools/call all flow through _drive. Inside _drive:

  1. incomingState === undefined → fresh-request branch.
  2. _resolvePrincipal(env) resolves owner.
  3. _perPrincipalCount.set(owner, n + 1) — quota consumed.
  4. token = this._generate() — UUID minted.
  5. _frames.set(token, { cont, timer: this._arm(token), owner, draining: true }) — global slot consumed, TTL timer armed.
  6. _startRunner(...) begins iterating the real handler.
  7. Handler yields a result and returns; runner pushes {kind: 'message'} then {kind: 'done'}.
  8. Drain loop sees done, calls _delete(token) → slot released.

Between steps 3–5 and step 8, the request holds a slot even though no frame is parked.

Why existing code doesn't prevent it

The capacity checks (_frames.size >= this._max, n >= this._perPrincipalMax) and the JSDoc all describe suspended frames — i.e., handlers parked at an await env.send(...). But because registration happens before the runner starts, the checks actually gate in-flight concurrency, not suspension count. Nothing in _drive distinguishes "will suspend" from "will complete synchronously"; that's only knowable once the first 'incomplete' ask appears on the channel.

Concrete walkthrough

With defaults (maxContinuations=1000, perPrincipalMax=ceil(1000/10)=100), suppose principal alice fires 101 concurrent tools/call requests for a tool that does I/O for 200ms but never calls env.send:

  • Requests 1–100 each enter the fresh branch, increment _perPrincipalCount['alice'] to 100, and start their runners.
  • Request 101 enters the fresh branch, finds n === 100 >= perPrincipalMax, and throws "principal at per-principal capacity"shttpHandler catches it and returns a 500.
  • A few hundred ms later all 100 runners push done, slots drain to 0 — but request 101 already failed.

Zero frames were ever actually parked. An operator who read the JSDoc ("Maximum suspended frames a single principal may hold") and set perPrincipalMax: 3 to bound pending elicitations would be surprised to find it rejecting the 4th concurrent non-elicitating request. There's also avoidable per-request overhead (crypto.randomUUID(), setTimeout, AbortController, two Channel allocations) for handlers that never suspend.

Impact and severity

This is a design/doc-accuracy issue rather than a correctness bug. The eager accounting is defensible as a more conservative DoS guard — in-flight coroutines also consume memory, and the runner's error path relies on this._frames.has(token) — and at default caps (1000/100) it's unlikely to bite in practice. But the three JSDoc strings (maxContinuations, perPrincipalMax, get size()) explicitly say "suspended", which the implementation does not honor. Filing as nit: worth fixing before merge but not blocking.

How to fix

Two options:

(a) Lazy registration — in the fresh branch, create the Continuation and start the runner, but don't increment _perPrincipalCount, generate a token, insert into _frames, or arm the TTL timer until the drain loop pulls the first 'incomplete' ask. If the runner reaches done/runnerError without ever producing incomplete, no slot was ever consumed. This also makes _resolvePrincipal's "refusing to suspend without a principal" error fire only when a suspend actually happens. Note the runner's if (this._frames.has(token)) guard would need adjusting (e.g., track liveness on the Continuation instead).

(b) Doc fix — keep the eager accounting and reword the JSDoc on maxContinuations, perPrincipalMax, and size from "suspended frames" to "in-flight or suspended frames routed through this store", so operators set caps with the right mental model.

Option (a) is the cleaner fix and also addresses the related concern that anonymous/principal resolution fires for requests that never suspend; option (b) is the minimal change if the conservative gating is intentional.

Comment on lines +146 to +153
*/
/**
* Validated identity to bind a frame to. Prefers the auth token (server-validated) over
* the legacy `mcp-session-id`. Never derived from `_meta` (client-asserted).
*/
function principalOf(env: RequestEnv): string | undefined {
return env.authInfo?.token ?? env.sessionId ?? (env.ext?.sessionId as string | undefined);
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 The class-level JSDoc block ("Opt-in suspend/resume continuation store...", lines 124-146) is not attached to export class ContinuationCompat — it is immediately followed by a second /** */ block + function principalOf, then interface FrameEntry, so TypeScript/TSDoc binds nothing to the exported class. Since this is a new public export with security-sensitive semantics (tenant isolation, single-process caveat, BackchannelCompat exclusivity), that documentation should surface in IDE hovers and .d.ts output. Move principalOf and FrameEntry above the JSDoc block, or move the block down to sit directly above the class declaration.

Extended reasoning...

What the bug is

The detailed JSDoc that documents ContinuationCompat's purpose, security model, single-process constraint, and mutual exclusivity with BackchannelCompat is orphaned — it is not associated with the export class ContinuationCompat declaration it describes. TypeScript and TSDoc tooling only attach the immediately preceding doc comment to a declaration, and here two intervening declarations break that adjacency.

The specific code path

In packages/server/src/server/continuationCompat.ts:

/**
 * Opt-in suspend/resume continuation store for shttpHandler...
 *   ...security model, single-process caveat, BackchannelCompat exclusivity...
 */                                            // ← lines 124-146, intended class doc
/**
 * Validated identity to bind a frame to...
 */
function principalOf(env: RequestEnv): ... {  // ← line ~151, consumes the second /** */
    ...
}

interface FrameEntry { ... }                   // ← lines ~155-167

export class ContinuationCompat {             // ← line ~169, no preceding /** */
    ...
}

The second /** */ block immediately follows the first, so the first block becomes a free-floating leading comment on principalOf (effectively dead in emit). FrameEntry then sits between principalOf and the class. By the time the compiler reaches export class ContinuationCompat, there is no doc comment directly above it.

Why existing code doesn't prevent it

Nothing in the build catches this: the file compiles fine, lint rules don't flag stacked JSDoc blocks, and tests cover runtime behaviour only. The PR adds export { ContinuationCompat } to packages/server/src/index.ts, so the symbol is part of the package's public surface, but .d.ts emission will produce export declare class ContinuationCompat { ... } with no leading comment.

Step-by-step proof

  1. tsc --declaration walks declarations and copies the immediately adjacent leading JSDoc into the .d.ts for each exported symbol.
  2. For principalOf, the immediately adjacent JSDoc is the small "Validated identity..." block — that gets attached (but principalOf is module-private, so it never reaches .d.ts anyway).
  3. The large "Opt-in suspend/resume..." block is not immediately adjacent to any declaration (another /** */ follows it), so it is treated as an orphan and dropped from emit.
  4. interface FrameEntry has no JSDoc above it (the preceding token is } closing principalOf).
  5. export class ContinuationCompat has no JSDoc above it (the preceding token is } closing FrameEntry).
  6. Result: hovering ContinuationCompat in an IDE, or reading the published .d.ts, shows no documentation. The carefully-written tenant-isolation rules, the "single-process only" warning, and the allowAnonymousSuspend security note are invisible to consumers.

Impact

Documentation-only — no runtime effect. But the PR description itself flags this as "the most security-sensitive new code in the stack", and the lost prose is exactly the part that tells deployers (a) frames are in-memory and need sticky routing, (b) anonymous suspend is unsafe, and (c) configuring this alongside BackchannelCompat has a defined precedence. Hiding that from API docs/IDE tooling undermines the point of writing it.

How to fix

Either move function principalOf and interface FrameEntry above the large JSDoc block, or relocate that JSDoc so it sits directly on top of export class ContinuationCompat. The latter is the smaller diff:

function principalOf(...) { ... }
interface FrameEntry { ... }

/**
 * Opt-in suspend/resume continuation store for shttpHandler...
 */
export class ContinuationCompat { ... }

@felixweinberger felixweinberger force-pushed the fweinberger/s3-requires-input branch from d7d52d9 to ef0c0ce Compare May 12, 2026 14:38
@felixweinberger felixweinberger force-pushed the fweinberger/f4-continuation branch from c316106 to da01f10 Compare May 12, 2026 14:39
@felixweinberger
Copy link
Copy Markdown
Contributor Author

@claude review

Comment thread packages/server/src/server/continuationCompat.ts
Comment on lines +45 to +55
/**
* If `false` (default), {@linkcode ContinuationCompat.drive} throws when no principal can
* be derived (no `authInfo.token` and no `mcp-session-id`). Anonymous suspension means any
* caller can resume any frame; only enable this for trusted single-tenant deployments.
*/
allowAnonymousSuspend?: boolean;
/**
* Maximum suspended frames a single principal may hold. New suspensions beyond this throw.
* Prevents one tenant exhausting `maxContinuations`. Defaults to `Math.ceil(maxContinuations / 10)`.
*/
perPrincipalMax?: number;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 🟡 The fixup commit da01f10 changed the fresh-request guards from throw to yield this._capacityError(...) and updated the maxContinuations JSDoc to say "yield a JSON-RPC -32000 error response" — but the sibling allowAnonymousSuspend doc (line 46) still says "{@linkcode ContinuationCompat.drive} throws when no principal can be derived" and perPrincipalMax (line 52) still says "New suspensions beyond this throw". Both should now read "yield a JSON-RPC -32000 error response" to match the implementation (and ContinuationCompat.drive is not a public symbol — the wrapped onrequest is what yields). ContinuationCompatOptions is exported, so this stale prose ships in .d.ts.

Extended reasoning...

What the bug is

Commit da01f10 ("yield JSON-RPC error (not throw) for principal/capacity guards so SSE clients receive it") fixed the runtime issue from review comment #3226706458 by replacing the three fresh-request throw statements in _drive with yield this._capacityError(request.id, ...) + return. The maxContinuations JSDoc was correctly updated in that commit to read:

New suspensions beyond this cap yield a JSON-RPC -32000 error response for the request.

But the two sibling option docs in the same ContinuationCompatOptions interface were missed:

  • allowAnonymousSuspend (lines 45-49): "If false (default), {@linkcode ContinuationCompat.drive} throws when no principal can be derived..."
  • perPrincipalMax (lines 51-54): "Maximum suspended frames a single principal may hold. New suspensions beyond this throw."

Both now describe behavior the code no longer ships. There's also a dangling {@linkcode ContinuationCompat.drive} reference — drive is not a public method (the implementation is private _drive, and consumers only ever see the wrapped onrequest returned from wrap()), so the linkcode resolves to nothing in TSDoc tooling.

Why existing code doesn't prevent it

This is a partial migration: da01f10 touched all three runtime guards but only one of the three corresponding doc strings. Nothing in the build flags doc/code drift, and the tests (which were also updated in da01f10 to assert error: { code: -32_000 } at status: 200) only exercise runtime behavior. The internal inconsistency — one option says "yield a JSON-RPC -32000 error", the next two say "throw" — makes it clearly an oversight in the fixup rather than an intentional distinction.

Step-by-step proof

  1. packages/server/src/index.ts exports ContinuationCompatOptions as public API, so its JSDoc is emitted into .d.ts and surfaces in IDE hovers.
  2. Read allowAnonymousSuspend doc → it promises a throw when no principal is derivable.
  3. Trace the implementation: _drive fresh-request branch → owner === undefined && !this._allowAnonymousyield this._capacityError(request.id, 'ContinuationCompat: refusing to suspend without a principal ...'); return;. No throw.
  4. Confirm via the test "refuses to suspend without a principal by default": it asserts r.status === 200 and error: { code: -32_000 } — i.e., a yielded JSON-RPC error, not a thrown 500.
  5. Same for perPrincipalMax: doc says "throw"; code at n >= this._perPrincipalMax does yield this._capacityError(...); test "rejects new suspension when principal is at perPrincipalMax" asserts status: 200 + error: { code: -32_000 }.
  6. Compare with maxContinuations doc on line 30, which was updated to "yield a JSON-RPC -32000 error response" — proving the intent and that the other two were simply missed.

Impact

Documentation-only — no runtime effect. But ContinuationCompatOptions is part of the package's public surface, so the "throws" claim reaches consumers via .d.ts and IDE hovers. An operator reading the doc would expect allowAnonymousSuspend: false to surface as a thrown error / 500 (and might wire onerror or HTTP-level error handling accordingly), when in practice the client receives a 200 with a JSON-RPC -32000 body. The {@linkcode ContinuationCompat.drive} reference also points at a nonexistent public symbol and will render as a dead link in TSDoc output.

How to fix

Reword both to match maxContinuations:

/**
 * If `false` (default), requests with no derivable principal (no `authInfo.token`
 * and no `mcp-session-id`) yield a JSON-RPC `-32000` error response. Anonymous
 * suspension means any caller can resume any frame; only enable this for trusted
 * single-tenant deployments.
 */
allowAnonymousSuspend?: boolean;
/**
 * Maximum suspended frames a single principal may hold. New suspensions beyond this
 * yield a JSON-RPC `-32000` error response. Prevents one tenant exhausting
 * `maxContinuations`. Defaults to `Math.ceil(maxContinuations / 10)`.
 */
perPrincipalMax?: number;

Filing as a nit since it's prose-only with no runtime impact, but it should land in the same PR — the stale text was introduced by this PR's own fixup commit, and the three options now contradict each other in the public type definition.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the typedoc-link fix and the allowAnonymousSuspend rewording — but perPrincipalMax (line 52) still reads "New suspensions beyond this throw." It should match maxContinuations and say "yield a JSON-RPC -32000 error response" (the implementation does yield this._capacityError(...), and the test asserts status: 200 + error: { code: -32_000 }).

Minor: for allowAnonymousSuspend, "rejects" is better than "throws" but still slightly ambiguous (could be read as promise rejection) — consider matching the maxContinuations phrasing here too so all three options describe the same behavior the same way.

@felixweinberger felixweinberger force-pushed the fweinberger/s3-requires-input branch from ef0c0ce to c1b00ac Compare May 12, 2026 15:57
@felixweinberger felixweinberger force-pushed the fweinberger/f4-continuation branch from da01f10 to c31540e Compare May 12, 2026 15:57
@felixweinberger
Copy link
Copy Markdown
Contributor Author

@claude review

Comment thread packages/server/src/server/continuationCompat.ts Outdated
Comment on lines +68 to +75
/**
* SEP-2322 Option H suspend/resume continuation store. When provided, a handler's
* `await ctx.mcpReq.send(...)` parks the live frame and the current response carries
* an `IncompleteResult{requestState}`; a later POST with `params.{requestState,
* inputResponses}` resumes the same frame. Single-process only. When both this and
* `backchannel` are set, this takes precedence (it supplies `env.send` first).
*/
continuations?: ContinuationCompat;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 nit: handleHttp's function-level JSDoc (handleHttp.ts:35-39) calls out options.session and options.backchannel as the way to make ctx.mcpReq.send work, but doesn't mention the new options.continuations — which is the sibling path for the same capability and is documented here as taking precedence over backchannel. Worth a one-line addition to that paragraph so the public entry point doesn't describe only one of two mutually-exclusive ways to enable server-to-client requests.

Extended reasoning...

What the gap is

handleHttp is the public entry point (shttpHandler is @internal per its own JSDoc), and HandleHttpOptions is a straight re-export of ShttpHandlerOptions (handleHttp.ts:54), so options.continuations added in this PR is now part of handleHttp's public surface. The function-level JSDoc on handleHttp (handleHttp.ts:19-40) explicitly enumerates the compat helpers in prose:

Supply a SessionCompat via options.session to serve clients that send Mcp-Session-Id …, and a BackchannelCompat via options.backchannel to let handlers' ctx.mcpReq.send (e.g. elicitInput, requestSampling) reach those clients over the open POST SSE stream.

It does not mention options.continuations / ContinuationCompat, which this PR adds as the third compat helper and the alternative path for exactly the same ctx.mcpReq.send capability.

Why this is worth fixing despite the doc not being exhaustive

It's true that handleHttp's JSDoc is a usage overview rather than an option reference — it already omits enableJsonResponse, eventStore, retryInterval, supportedProtocolVersions, and onerror, and the per-option documentation lives on ShttpHandlerOptions where IDE hover picks it up. So the omission of continuations is not hiding the option from consumers.

But the omitted options are orthogonal infrastructure knobs. backchannel and continuations are different: they are siblings for the same feature (both supply env.send so ctx.mcpReq.send resolves), they are mutually exclusive (the option JSDoc here says "When both this and backchannel are set, this takes precedence"), and the handleHttp prose specifically singles out backchannel as the way to make ctx.mcpReq.send work. After this PR that sentence is no longer the whole story — a reader of the public entry point would reasonably conclude backchannel is the only option, when in fact the SEP-2322 path is the forward-looking one.

The PR is internally inconsistent on this: ContinuationCompat's own class JSDoc says it is "for {@linkcode handleHttp}" and wrap()'s JSDoc says "Called by {@linkcode handleHttp} when this instance is configured", and the prior stack PRs each added their compat helper (session, backchannel) to exactly this paragraph. This one should follow suit.

Step-by-step

  1. packages/server/src/index.ts exports handleHttp and ContinuationCompat as public API; shttpHandler is not exported (it's @internal).
  2. A consumer reads handleHttp's JSDoc to learn how to enable elicitInput/requestSampling in handlers. Lines 37-39 say: pass options.backchannel.
  3. Nothing in that prose hints that options.continuations exists or that it is the SEP-2322 alternative for stateless deployments — even though ShttpHandlerOptions.continuations (this hunk, lines 68-75) and the ContinuationCompat class doc both describe it as the handleHttp integration and document the precedence over backchannel.
  4. The consumer would have to discover continuations by expanding HandleHttpOptions in their IDE, rather than from the canonical entry-point prose that already lists its sibling.

Impact

Documentation-only; no runtime effect. The option is fully documented at the type level, so this is purely an asymmetry in the function-level prose. Filing as nit.

Suggested fix

Append one clause to handleHttp.ts:37-39, e.g.:

 * and a `BackchannelCompat` via `options.backchannel` to let handlers'
 * `ctx.mcpReq.send` (e.g. `elicitInput`, `requestSampling`) reach those
 * clients over the open POST SSE stream  or a `ContinuationCompat` via
 * `options.continuations` for the SEP-2322 suspend/resume path on stateless
 * transports (takes precedence over `backchannel` when both are set).

(Note: the edit is in handleHttp.ts, which this PR doesn't touch; the comment is anchored on the ShttpHandlerOptions.continuations hunk since that's where the option enters the diff.)

Opt-in stateful path for handleHttp: a handler's await ctx.mcpReq.send(...) parks the live frame; current response carries IncompleteResult{requestState}; a later POST with {requestState, inputResponses} resumes the same frame.

Single-process only. Frames bound to the validated principal (authInfo.token or sessionId); anonymous suspension refused by default. Per-principal frame cap prevents one tenant exhausting maxContinuations.
…al/capacity guards so SSE clients receive it
…ed rejection on early close); fix typedoc links
…al; bridge per-round HTTP abort during each drain
@felixweinberger felixweinberger force-pushed the fweinberger/s3-requires-input branch from c1b00ac to e579bff Compare May 12, 2026 17:28
@felixweinberger felixweinberger force-pushed the fweinberger/f4-continuation branch from c31540e to bd3442e Compare May 12, 2026 17:29
@felixweinberger
Copy link
Copy Markdown
Contributor Author

@claude review

Comment thread packages/server/src/server/continuationCompat.ts Outdated
Comment on lines +66 to +70
describe('ContinuationCompat — suspend/resume via shttpHandler', () => {
let continuations: ContinuationCompat;
afterEach(() => {
continuations?.close();
});
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 The per-round env.signalcont.abort bridge added in bd3442e (continuationCompat.ts:323-334, 371) has no test coverage — grepping this file for cancelled|abort|signal returns zero matches. Since that bridge was added specifically to fix the resume-round-cancellation bug from review #3228252344 and is non-trivial lifecycle code (listener attach/detach across try/finally, already-aborted fast path), it would be worth a regression test: suspend, resume with a slow handler, POST notifications/cancelled for the resume id, and assert continuations.size === 0. REVIEW.md asks for vitest coverage of error paths on new behavior.

Extended reasoning...

What the gap is

Commit bd3442e introduced the per-round abort bridge in _drive (continuationCompat.ts:323-334 plus the removeEventListener in the finally at line 371) so that notifications/cancelled or an SSE disconnect on a resume round actually reaches the long-lived runner via cont.abort. This was the fix for the 🔴 review comment #3228252344, where round-2's env.signal was previously never observed and a resumed handler ran until ttlMs with no way to cancel it.

continuationCompat.test.ts does not exercise this path. Reading the file end-to-end, the eleven tests cover: round-trip suspend/resume, SSE-mode notifications, batched concurrent env.send, principal-mismatch reject, unknown requestState, missing inputResponses slot, concurrent-resume guard, TTL eviction, anonymous-suspend reject, perPrincipalMax, and the non-suspending pass-through. A grep of the file for cancelled, abort, signal, or cancel (case-insensitive) returns zero matches — none of the tests POST a notifications/cancelled, abort an AbortController, or close an SSE stream mid-drain.

Why this matters

The bridge is exactly the kind of async/lifecycle code that regresses silently:

  • It pairs addEventListener('abort', onAbort, { once: true }) with a removeEventListener in a finally, so a refactor that returns early (or moves the finally) could leave the listener attached and kill a parked frame on a late abort — the very behavior the comment at line 327 says it must avoid.
  • It has an already-aborted fast path (if (env.signal?.aborted) cont.abort.abort(...)) that is easy to drop in a future cleanup.
  • It is the only link between shttpHandler's per-request ctrl.signal and the frame-lifetime cont.abort.signal; removing it reverts to the pre-bd3442e behavior with no test failing.

REVIEW.md's checklist explicitly requires "New behavior has vitest coverage including error paths", and the PR description self-identifies this file as "the most security-sensitive new code in the stack". Shipping the most recent fixup commit with zero regression protection means a future refactor could silently re-break resume-round cancellation.

Step-by-step proof of the gap

  1. git show bd3442e -- packages/server/src/server/continuationCompat.ts adds the onAbort listener block (lines 323-334) and the env.signal?.removeEventListener('abort', onAbort) in the drain-loop finally (line 371), and rewrites _startRunner to pass only cont.abort.signal (line 387).
  2. git show bd3442e -- packages/server/test/server/continuationCompat.test.ts shows no hunk — the commit touched only the source file.
  3. rg -in 'cancelled|abort|signal|cancel' packages/server/test/server/continuationCompat.test.ts → no output.
  4. Therefore, deleting lines 323-334 and 371 (reverting to the pre-fix state where the resume round's env.signal is never read) would leave pnpm vitest run continuationCompat green. The original bug — notifications/cancelled for a resume id being silently dropped — could reappear with no failing test.

Suggested test

Something along these lines would lock the fix in:

it('notifications/cancelled on a resume round aborts the frame', async () => {
    continuations = new ContinuationCompat({ allowAnonymousSuspend: true });
    let release!: () => void;
    const gate = new Promise<void>(r => (release = r));
    const cb: ShttpCallbacks = {
        async *onrequest(req, env) {
            await env!.send!({ method: 'elicitation/create' });
            await gate;                       // long post-resume work
            yield { jsonrpc: '2.0', id: req.id, result: { ok: true } };
        }
    };
    const handler = shttpHandler(cb, { continuations, enableJsonResponse: true });

    const r1 = await handler(post({ jsonrpc: '2.0', id: 1, method: 'x' }));
    const inc = asIncomplete((await r1.json()) as JSONRPCMessage);

    const resumeP = handler(post({
        jsonrpc: '2.0', id: 2, method: 'x',
        params: { requestState: inc.requestState, inputResponses: { r0: {} } }
    }));
    await Promise.resolve();                  // let the resume drain start

    await handler(post({ jsonrpc: '2.0', method: 'notifications/cancelled', params: { requestId: 2 } }));

    const m2 = (await (await resumeP).json()) as JSONRPCMessage;
    expect(m2).toMatchObject({ id: 2, error: { code: -32603 } });
    expect(continuations.size).toBe(0);
    release();
});

A second variant that asserts the listener is detached (suspend, return on incomplete, then abort the round-1 controller and verify continuations.size === 1 — frame still parked) would cover the finally half.

Severity

Nit — this is a coverage gap, not a runtime defect; the bridge itself is correct. But it is a fair ask given the bridge was a bugfix in code the PR flags as security-sensitive, and the repo's own review checklist requires error-path coverage.

…efensive try/catch in _suspendingSend onFulfilled (primitive value caused unhandled rejection)
@felixweinberger
Copy link
Copy Markdown
Contributor Author

@claude review

Copy link
Copy Markdown

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks — 72f15f7 addresses the inputResponses type-guard / unhandled-rejection issue and I have nothing further blocking; the remaining open items are the 🟡 nits already filed inline (orphaned class JSDoc, perPrincipalMax doc still says "throw", handleHttp prose, abort-bridge test coverage, eager quota accounting). Given this is self-described as the most security-sensitive piece of the stack, deferring to a human for the design/security sign-off.

Extended reasoning...

Overview

Adds ContinuationCompat (~500 new lines in continuationCompat.ts), a suspend/resume continuation store implementing SEP-2322 Option H. Wires it into shttpHandler via a new continuations option, exports ContinuationCompat + ContinuationCompatOptions from the package root, and adds ~390 lines of vitest coverage. The mechanism parks a live handler frame in-memory when it awaits env.send(...), returns an IncompleteResult{requestState}, and resumes the same frame on a later POST.

Security risks

This is explicitly the most security-sensitive PR in the stack per the author's own description. It introduces: (a) tenant isolation via principal binding (authInfo.tokensessionId fallback) — a bug here lets one caller resume another's frame; (b) an in-memory store of live coroutine frames keyed by an opaque requestState token whose unguessability is the access-control boundary when allowAnonymousSuspend is on; (c) DoS surface via maxContinuations/perPrincipalMax caps and TTL; (d) untrusted client input (params.inputResponses) flowing into the resumed handler. Over five review rounds I flagged and the author fixed: SSE-mode error swallowing, an unhandled-rejection on early close(), a per-round abort-signal lifecycle bug, and a primitive-inputResponses crash. The current revision looks defensible on those, but the principal-derivation precedence, the allowAnonymousSuspend escape hatch, and the single-process constraint all warrant human sign-off.

Level of scrutiny

High. This is a new public API export, new async/lifecycle machinery (two hand-rolled channels, an abort bridge with attach/detach across try/finally, microtask-batched flushing), and it directly handles auth-derived principals. It also has no changeset yet. Per REVIEW.md this hits "API surface — every new export is intentional", "Async: race conditions, cleanup on cancellation, unhandled rejections", and "works for both stdio and Streamable HTTP" — none of which a bot should be the sole reviewer on.

Other factors

The author has been very responsive — every 🔴 I raised was fixed within the hour. Remaining open items are all 🟡 nits I've already posted inline and won't repeat here. Test coverage is reasonable (12 cases) but still missing a regression test for the bd3442e abort bridge. The handleHttp entry-point JSDoc and the orphaned class-level JSDoc are doc-only. None of these block, but a maintainer should decide whether they gate merge. This PR also depends on S3/R3 in the stack and references an external review guide gist — a human should confirm it matches the agreed Option H design.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

v2-stateless 2026-06 SDK: Protocol decomposition + SEP alignment (request-first / stateless)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant